Adaptive detection of missed text areas in OCR outputs: application to the automatic assessment of OCR quality in mass digitization projects
نویسندگان
چکیده
The French National Library (BnF∗) has launched many mass digitization projects in order to give access to its collection. The indexation of digital documents on Gallica (digital library of the BnF) is done through their textual content obtained thanks to service providers that use Optical Character Recognition softwares (OCR). OCR softwares have become increasingly complex systems composed of several subsystems dedicated to the analysis and the recognition of the elements in a page. However, the reliability of these systems is always an issue at stake. Indeed, in some cases, we can find errors in OCR outputs that occur because of an accumulation of several errors at different levels in the OCR process. One of the frequent errors in OCR outputs is the missed text components. The presence of such errors may lead to severe defects in digital libraries. In this paper, we investigate the detection of missed text components to control the OCR results from the collections of the French National Library. Our verification approach uses local information inside the pages based on Radon transform descriptors and Local Binary Patterns descriptors (LBP) coupled with OCR results to control their consistency. The experimental results show that our method detects 84.15% of the missed textual components, by comparing the OCR ALTO files outputs (produced by the service providers) to the images of the document.
منابع مشابه
Measuring Search Retrieval Accuracy of Uncorrected OCR: Findings from the Harvard-Radcliffe Online Historical Reference Shelf Digitization Project
This report presents the findings of an investigation to evaluate the conditions for search retrieval successes and failures when using uncorrected OCR for indexing. The purpose of the study was to assess whether low-cost, high-production techniques for text conversion were adequate to produce digital reproductions of consistent quality and usability. We sought to identify attributes of the ori...
متن کاملAutomatic Assessment of OCR Quality in Historical Documents
Mass digitization of historical documents is a challenging problem for optical character recognition (OCR) tools. Issues include noisy backgrounds and faded text due to aging, border/marginal noise, bleed-through, skewing, warping, as well as irregular fonts and page layouts. As a result, OCR tools often produce a large number of spurious bounding boxes (BBs) in addition to those that correspon...
متن کاملDocument Image Dewarping Based on Text Line Detection and Surface Modeling (RESEARCH NOTE)
Document images produced by scanner or digital camera, usually suffer from geometric and photometric distortions. Both of them deteriorate the performance of OCR systems. In this paper, we present a novel method to compensate for undesirable geometric distortions aiming to improve OCR results. Our methodology is based on finding text lines by dynamic local connectivity map and then applying a l...
متن کاملNon-interactive OCR Post-correction for Giga-Scale Digitization Projects
This paper proposes a non-interactive system for reducing the level of OCR-induced typographical variation in large text collections, contemporary and historical. Text-Induced Corpus Clean-up or ticcl (pronounce ’tickle’) focuses on high-frequency words derived from the corpus to be cleaned and gathers all typographical variants for any particular focus word that lie within the predefined Leven...
متن کاملSkew Detection Using the Radon Transform*
In an automatic document conversion system, which builds digital documents from scanned articles, there is the need to perform various adjustments before the scanned image is fed to the OCR system. This is because the OCR system is prone to error when the text is not properly identified, aligned, de-noised, etc. Such an adjustment is the detection of page skew, an unintentional rotation of the ...
متن کامل